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ABSTRACT 

Error  masking  by  majority  voting  remains  an  important  fault  tolerance  technique 
for  realizing  highly  reliable  computer  systems  for  critical  control  applications. 


However,  VLSI  technology  has  Imposed  a  relatively  high  cost  on  hardware  voter  cir¬ 
cuits  because  of  their  high  interconnect  complexity.  In  this  paper  we  present  and 
analyze  a  new  design  for  redundant  microcomputer  systems  which  appears  well  suited  / 
for  Implementation  from  VLSI  modules.  In  the  proposed  design,  redundant  computing  I  (  Quality 
units  (CU’s)  making  up  the  system  communicate  with  each  other  periodically  to  restore  V  Spected 
units  that  may  have  been  disabled  by  transient  faults.  The  system  is  also  protected  \  4  > 

against  permanent  failures.  To  evaluate  the  proposed  approach  a  reliability  model  ^ 

for  triple  redundant  periodically  self  restoring  systems  Is  developed  In  this  paper. 

The  model  accounts  for  both  permanent  and  transient  faults  during  system  operation, 
as  well  as  the  possibility  of  undetected  failures  In  the  redundant  units  at  the  start 


of  the  mission. 


1 .  INTRODUCTION 

Modular  redundancy  schemes  have  been 
widely  used  for  Implementing  hardware  fault 
tolerance  in  computer  systems  designed  for 
critical  real  tine  control  applications.  In- 
the  classical  "static"  voted  redundancy 
schemes,  TMR  (triple  modular  redundancy  Cl]) 
and  Its  generalization  NMR  [2],  failures  In  ln-> 
dividual  modules  are  masked  by  voter  circuits 
so  that  a  subsystem  stays  operational  as  long 
as  a  majority  of  Its  redundant  modules  continue 
to  operate  correctly.  Systems  employing  the 
■dynamic"  standby  sparing  scheme  [3]  have  the 
built  In  capability  of  automatically  detecting 
module  failures,  and  replacing  the  failed 
module  from  a  pool  of  redundant  spares.  For 
the  same  level  of  redundancy,  a  standby  sparing 
system  Is  generally  better  for  long  missions 
because  It  can  usually  stay  operational  down  to 
the  last  correctly  operating  module.  A  NMR 
system  operates  correctly  only  as  long  as  a 
majority  of  the  redundant  modules  are  failure 
free.  However,  for  applications  requiring 
highly  reliable  operation  for  short  periods, 
the  NMR  system  is  often  better  because  a 
majority  of  the  redundant  modules  are  unlikely 
to  fall  over  such  a  short  interval.  The 
standby  sparing  system  is  more  susceptible  to 
malfunction  In  Its  generally  complex  fault 
detection  and  reconfiguration  circuitry,  which 
can  lead  to  system  failure  well  before  all  the 
spares  are  exhausted.  Several  ’hybrid'  redun- 
dancy  schemes  have  been  proposed  [4],  [5],  [6] 
to  combine  the  advantages  of  NMR  and  standby 
sparing.  Although  some  of  the  early  space  com¬ 
puters  used  basic  TMR  [7],  state-of-the-art 


ultra-reliable  systems  such  as  the  SIFT  [8]  and 
the  FTMP  [9]  employ  the  hybrid  approach  by 
voting  only  among  the  redundant  modules  that 
are  believed  to  be  operating  correctly;  failed 
modules  are  discarded  from  the  vote. 
Nevertheless,  even  In  these  systems,  simple  er- 
ror  masking  by  majority  voting  remains  a 
primary  fault  tolerance  technique  for  achieving 
highly  reliable  operation. 

In  Implementing  a  TMR  (or  NMR)  system,  the 
system  designer  must  decide  on  how  the  voting 
Is  carried  out  (by  hardware  or  by  software), 
and  where  the  voters  are  placed  (what  signals 
or  data  are  to  be  voted  on).  These  decisions 
have  a  very  significant  Impact  on  the  resulting 
system  reliability.  In  syster.-j  Implemented  out 
of  VLSI  building  blocks,  reliability  depends 
more  on  the  chip  count  and  the  Interconnect 
complexity  than  on  the  circuit  complexity  per 
se.  Unfortunately,  voting  requires  a  large 
number  of  Interconnections.  Thl3  results  In 
hlgn  modular  complexity  because  pinout  limita¬ 
tions  restrict  the  number  of  voter  circuits 
that  can  be  placed  on  a  VLSI  chip.  Because 
dedicated  computer  systems  deployed  In  control 
applications  are  often  Implemented  from  rela¬ 
tively  few  VLSI  modules,  the  Injudicious  use  of 
a  large  number  of  hardware  voters  can  actually 
reduce  overall  systems  reliability  by  sig¬ 
nificantly  Increasing  the  modular  and 
Interconnect  complexity  of  the  system  [12]. 

The  placement  of  the  voters  In  a  redundant 
system  also  greatly  effects  Its  capability  of 
handling  transient  failures.  The  complex  VLSI 
modules  used  In  Implementing  present-day  sys¬ 
tems  are  generally  complex  sequential  circuits 
containing  memory.  A  transient  fault  In  such  a 
circuit  can  alter  Its  state  and  result  In  con- 
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tlnued  erroneous  operation  until  the  circuit  Is 
correctly  resynchronized.  Unless  the  voters 
help  resynchronl2e  such  a  disabled  nodule.  It 
has  the  sane  effect  on  the  systen  as  a  nodule 
with  a  permanent  failure.  Thus  a  build  up  of 
disabled  nodules  In  a  subsystem,  beyond  Its 
fault  tolerance  capability,  can  cause  systen 
failure  even  though  hardware  resources  exist 
In  the  system  for  continued  operation.  This 
can  be  a  major  reliability  degrading  factor  be* 
cause  emperical  data  has  shown  that  transient 
failures  are  far  more  frequent  than  permanent 
faults  [10]. 

Recent  NMR  designs  have  taken  two  ap- 
proaches  on  the  question  of  voter  .placement. 
One  approach,  used  In  the  space  shuttle  com¬ 
puters  [11],  Is  to  vote  only  on  the  outputs 
obtained  fron  a  pool  of  Independently  operating 
redundant  computer  systems.  Since  the  redun¬ 
dant  units  nere  do  not  Interact  to  effect 
transient  upset  recovery,  this  approach  Is  only 
suited  to  relatively  short  missions  and  re- 
quires  a  high  level  of  redundancy  for 
acceptable  reliability.  The  other  approach 
suggested  by  Wakerly  [12]  and  employed  In  the 
C.vmp  [13]  Is  to  use  additional  voters  on  the 
signal  lines  connecting  the  processors  and 
memories  as  shown  In  Figure  1.  (The  C.vmp  was 
actually  Implemented  with  a  single  bidirec¬ 
tional  voter,  but  It  takes  triplicated  voters 
to  provide  full  single-point  failure 
protection].  In  such  a  design  the  normal  flow 


Processors 


S141rectona1 

Voters 


Figure  1:  C.vmp  type  architecture.  (Only 
a  single  voter  was  actually 
Implemented) 

of  data  between  the  processors  and  memories 
resynchronizes  any  processor  that  experiences  a 
transient  upset.  However,  note  the  high  Inter¬ 
connect  complexity  created  by  the  voters 
because  of  the  large  number  of  Interconnection 
lines  that  run  between  the  processor  and  memory 
In  a  computer  system.  Address,  data,  as  well 
as  timing  and  control  signals  must  all  be  voted 
upon  and  correctly  synchronized.  The 
reliability  of  C.vmp  type  architectures  has 
been  analyzed  by  Wakerly  [12]  with  respect  to 


•permanent  faults.  The  analysis  Indicates  an 
improvement  In  overall  system  reliability  only 
If  the  modular  complexity  of  the  voters  Is  low 
as  compared  to  that  for  the  rest  of  the  system. 
While  this  Is  true  of  the  C.vmp  because  It  has 
been  Implemented  out  of  MSI  and  LSI  components, 
It  doesnot  generally  to  hold  for  systems  Imple¬ 
mented  In  VLSI  technology.  Therefore,  newer 
designs  such  as  the  SIFT  and  the  FTMP  basically 
employ  the  first  approach  and  provide  massive 
redundancy.  They  do  not  attempt  to  restore 
units  upset  by  transient  faults,  but  Instead, 
discard  units  making  repeated  errors  from  the 
vote  and  assume  that  such  units  have  per* 
manently  failed.  However,  some  transient 
upsets  may  still  be  corrected  In  these  systems 
because  partial  results  are  voted  on  among  the 
redundant  units  (explicitly  by  the  software  In 
SIFT,  by  the  hardware  during  transfers  between 
module  triads  In  FTMP)  and  the  majority  opinion 
used  by  all  processors  for  further  computation. 

In  this  paper  we  present  and  analyze  a  new 
approach  wherein  a  set  of  redundant  computer 
systems  stay  In  sychronlzatlon  by  periodically 
communicating  and  restoring  each  other  to  the 
majority  consensus  state.  Because  this  consen¬ 
sus  voting  Is  done  In  software,  such  as  scheme 
provides  protection  from  transient  upsets 
without  increasing  the  hardware  complexity  of 
the  system.  We  shall  show  that  such  perl* 
odlcally  self  restoring  redundant  (PSRR) 
systems  offer  a  very  attractive  approach  for 
Implementing  TMR ,  NMR,  and  hybrid  redundancy 
systems  In  a  VLSI  environment.  In  Section  III 
wl  present  a  reliability  model  for  triple 
redundant  PSRR  systems,  which  accounts  for  the 
effects  or  both  transient  and  permanent  faults. 
Making  reasonable  assumptions,  closed  fora  ex¬ 
pressions  for  system  reliability  and  mean  time 
to  failure  are  derived.  These  results  should 
prove  quite  useful  to  the  designer  of  PSRR  sys* 
terns  because  triple  redundant  systems  can  often 
provide  the  needed  reliability  for  many 
'applications.  More  elaborate  models  for 
general  N  redundant  systems  have  been  developed 
and  can  be  found  In  [15]. 

It  should  be  noted  that  In  addition  to 
recovery  from  transient  upsets,  there  are  other 
important  Issues  that  must  be  addressed  In  the 
Implementation  of  NMR  systems.  These  Include 
the  design  of  reliable  clocking  circuits  to 
provide  synchronized  clock  signals  to  the 
redundant  modules.  This  clock  system  must  It¬ 
self  be  redundant  and  protected  against 
single-point  failures.  Issues  relating  to  data 
I/O  from  the  system  and  Its  Interactive  consis¬ 
tency  are  also  of  critical  Importance.  These 
problems  are  common  to  all  redundant  designs 
and  have  been  addressed  In  the  literature  [16], 
[17],  [18].  Here  we  shall  assume  that  a  reli¬ 
able  fault  tolerant  clock  can  be  designed,  and 
that  each  I/O  Interface  has  voters  that  monitor 
the  output  signals  from  the  redundant  proces¬ 
sors  and  simultaneously  provide  Input  signals 
to  all  the  redundant  units. 


2.  PSRR  SYSTEM  DESCRIPTION 
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A  triple  redundant  PSRR  eyeten  employes 
three  computing  units  (CD's)  operating  redun** 
dantly  in  synchronization.  Each  CU  has  all  the 
computational  capabilities  of  the  desired  fault 
tolerant  system  and,  in  general.  Is  made  up  of 
processors,  memories  and  I/O  units.  System  in* 
put  Is  provided  simultaneously  to  all  three 
CU's.  System  output  is  taken  to  be  the 
majority  of  the  three  outputs  available  from 
the  CU's.  The  voting  Is  carried  out  on  each 
output  word. 

A  CU  Is  said  to  be  operational  If  Its  com¬ 
putations  are  error  free.  Since  for  a 
computing  system  correct  operation  only  re¬ 
quires  the  correct  execution  of  a  set  of 
programs  and  not  necessarily  the  correct 
functioning  of  all  hardware  components,  the 
proposed  triple  redundant  system  will  be  con¬ 
sidered  operational  as  long  as  the  correct 
output  can  be  recognized  from  among  the  three 
outputs  available  from  the  CU’s.  Clearly  this 
requires  that  at  least  two  CU's  be  operational. 

In  general,  failure  in  a  computing  system 
may  occur  for  diverse  reasons.  These  Include 
Improper  design,  software  failures,  component 
failures,  etc.  In  this  paper  we  shall  confine 
ourselves  to  system  failure  resulting  from  the 
failure  of  hardware  components.  We  shall  as- 
sume  that  the  system  has  been  properly  designed 
and  that  the  software  is  error  free. 

The  failure  of  a  hardware  component  may  be 
either  permanent  or  transient  (Intermittent)  In 
nature.  In  the  proposed  redundant  system  made 
up  of  tnree  CU's  operating  In  synchronization, 
a  CU  may  fall  and  fall  out  of  step  due  to  er¬ 
rors  caused  by  both  transient  and  permanent 
component  failures.  This  Is  because  a  tran¬ 
sient  failure  can  upset  the  state  of  the  CU  and 
result  In  continued  erroneous  computations  un¬ 
til  the  system  Is  restored.  A  CU  that  has 
failed  In  this  manner  (due  to  a  transient)  will 
be  considered  to  have  temporar 1 ly  failed.  A  CU 
with  a  permanent  failure  In  a  hardware  com¬ 
ponent  Is  said  to  have  permanently  failed. 
While  permanently  failed  CU's  obviously  cannot 
be  resynchronized  with  the  rest  of  the  system. 
If  the  CU  failure  Is  temporary  (due  to  a 
transient).  It  can  be  brought  back  into 
synchronization  with  the  rest.  In  the  proposed 
system  this  Is  done  by  the  three  CU’s  com* 
munlcatlng  with  each  other  periodically  to 
restore  failed  CU's.  To  facilitate  this  com¬ 
munication,  each  CU  Is  directly  linked  to  both 
the  other  two  CU's,  as  shown  In  Figure  2.  The 
restoration  program  Is  Initiated  by  non  mask¬ 
able  Interrupt  from  an  external  fault  tolerant 
clocking  circuit  (such  as  the  one  described  in 
(16])  and  Is  executed  by  each  CU  out  of  read 
only  memory  (ROM).  This  insures  that  a  tran¬ 
sient  failure  that  may  have  corrupted  the 
memory  In  a  CU  does  not  prevent  It  from  par¬ 
ticipating  In  the  restoration  process.  During 
the  restoration  Interval  the  three  CU's  vote  In 
their  entire  memory  contents  and  also  on  their 
processor  states,  replacing  disagreeing  words 


Figure  2:  Triple  redundant  PSRR  system 

with  the  majority  opinion.  Thus,  If  over  the 
restoration  interval  the  system  stays  opera¬ 
tional,  that  la  two  or  more  of  the  CU's  operate 
correctly,  then  a  temporarily  failed  CU  will  be 
restored  provided  It  does  not  experience  addi¬ 
tional  failures  during  this  period. 

Operation  of  the  redundant  system  Is 
broken  up  Into  computing  intervals,  when  the 
system  Is  performing  useful  computation,  the 
restoration  Intervals,  when  temporarily  failed 
processors  are  being  restored  (Figure  3).  The 
length  of  the  restoration  interval  Is  deter¬ 
mined  by  the  time-  required  to  execute  the 
restoration  program  and  la  a  function  of  the 
memory  size  of  each  CU.  As  discussed  below,  it 
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Figure  3:  PSRR  aystr  operation 

is  estimated  that  the  restoration  Interval  will 
be  quite  small,  typically  less  than  a  second 
for  most  systems.  The  length  of  the  computa¬ 
tion  Interval  Is  a  design  option  and  determines 
the  computation-restoration  ( C-R )  cycle  time. 
Shorter  C-R  cycle  times  Imply  more  frequent 
system  restoration.  This  Increases  system 
reliability  by  reducing  the  probability  of  sys¬ 
tem  failure  due  to  an  accumulation  of 
temporarily  failed  CU’s. 

At  first  It  may  appear  that  a  word  by  word 
vote  on  the  entire  memory  contents  Is  not  very 
attractive,  particularly  for  real  time  control 
applications,  because  the  system  Is  not  avail¬ 
able  for  control  purposes  during  this  time. 
This  draw  back  may  Indeed  disqualify  PSRR  sys¬ 
tems  for  some  applications.  Note  however  that 
such  dedicated  systems  often  do  not  require 
large  memories.  For  example  the  main  memory  on 
the  SIFT  and  FTMP  Is  only  32K  words.  Further, 
because  the  control  programs  once  developed  do 
not  have  to  be  changed,  they  can  be  stored  In 
ROM  which  does  not  have  to  be  voted  on.  Thus 
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tne  read/write  (RAH)  memory  size  In  such  sys¬ 
tems  can  be  reduced  Co  only  that  required  to 
store  and  manipulate  data  and  program  control. 
Assuming  an  average  time  of  ten  microseconds  to 
vote  on  one  memory  location,  a  system  can  vote 
on  100K  memory  words  In  one  second.  Thus  the 
restoration  Interval  for  most  PSRR  systems  will 
last  only  a  fraction  of  a  second.  Even  If  res* 
toratlon  la  carried  once  every  few  minutes 
which,  as  we  shall  see  from  the  reliability 
model  In  the  next  section  will  usually  be  often 
enough,  the  performance  degradation  due  to  com* 
putatlon  time  lost  during  restoration  Is 
negligible.  In  comparison,  the  extra  delays 
Introduced  by  the  voter  In  the  C.vmp  result  in 
a  performance  degradation  of  about  15  percent 
Cl 3]  as  compared  to  a  non  redundant  system. 

The  triple  redundant  PSRR  system  described 
above  can  be  generalized  to  a  N  redundant  sys- 
tem  In  a  straight  forward  way.  The  latter  will 
comprise  of  N  fully  connected  CU's  with  the  pe¬ 
riodic  restoration  carried  out  by  a  majority 
vote  on  the  N  system  states.  In  practice,  the 
reliability  of  N  redundant  PSRR  systems  can  be 
still  further  Improved  by  modifying  the  res- 
toratlon  process  so  as  to  discard  CU's  that  are 
repeatedly  In  a  minority  (presumed  to  have 'per¬ 
manently  failed)  from  the  vote.  Such  a  "self 
purging"  N  redundant  PSRR  system  would  Incor¬ 
porate  the  advantages  of  dynamic  redundancy 
Into  the  PSRR  concept.  Robust  distributed  res¬ 
toration  algorithms  for  such  self  purging 
systems  are  currently  under  investigation. 

3.  RELIABILITY  MODEL 

The  reliability  R(t)  of  a  fault  tolerant 
system  to  time  t  Is  defined  to  be  the  probabll" 
lty  that  the  system  stays  operational  up  to 
time  t,  given  that  It  was  completely  free  of 
failures  at  time  t  -  0, 

For  the  purpose  of  this  analysis,  we  shall 
assume  that  the  restoration  Interval  Is  negli¬ 
gibly  short  and  that  no  failure  takes  place 
during  the  restoration  process.  This  assump¬ 
tion  Is  Justified  later  in  this  section.  The 
assumption  Implies  that  at  the  Instant  Just 
after  restoration,  If  the  system  Is  opera¬ 
tional,  It  does  not  contain  a  temporarily 
failed  CU.  At  any  such  instant,  therefore,  the 
system  must  be  In  one  of  the  following  three 
states 

State  1:  All  CU's  operational 
State  2:  Exactly  two  CU's  operational,  the 
third  having  permanently  failed. 

State  3:  Failed  system  due  to  more  than  one 
failed  CU.  (This  may  result  from  any 
combination  of  temporary  and  per¬ 
manent  CU  failures.) 

Using  these  states  we  can  model  the  opera- 
tlon  of  the  proposed  redundant  system  as  a 
three  state  discrete  parameter  Markov  chain 
[19].  The  state  transition  probabilities  over 
one  C-R  cycle  time  form  the  one  step  matrix  of 
transition  probabilities, 
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P13 

T  - 

p21 

P22 

p23 
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P32 

P33 

It  can  be  readily  seen  that  some  of  the 
elements  of  T  are  trivial.  p21  -  0  because 

once  the  system  has  a  CU  with  a  permanent 
failure  (state  2),  It  can  never  go  to  a  state 
with  all  operational  CU's  (state  1).  Also  p^ 

-  0,  p,,,  •  0  and  p^2  “  1 ,  because  If  the  system 

falls,  (state  3)  It  can  never  recover  to  an 
operational  state. 

Making  these  entries  in  T  we  find  that  the 
matrix  of  transition  probabilities  is  upper 
triangular. 

P11  p12  p13 

T  -  °  p22  P23 

0  0  1 

The  five  remaining  unknown  elements  In  T 
depend  on  the  CU  failure  probabilities  and  the 
C-R  cycle  time  for  the  system,  from  these 
parameters,  the  remaining  elements  can  be 
evaluated  as  follows. 

Let  pt  by  the  probability  of  a  temporary 

CU  failure  occurring  during  a  C-R  cycle.  Let 
a  .  (1  -  p  )  be  the  probability  that  such  a 

failure  does  not  occur.  If  a  constant  tran¬ 
sient  failure  rate  Is  assumed,  then  pfc  -  (1 

t*  e*Xtt0),  where  tQ  is  the  OR  cycle  time.  We 

shall  assume  that  pfc  Is  the  same  whether  or  not 

the  CU  is  operational. 

Similarly  let  pp  be  the  probability  that  a 

permanent  CU  failure  occurs  over  a  C-R  cycle, 
q  -  (1  -  p  )  Is  the  probability  that  such  a 
p  P 

failure  Coes  not  occur.  If  lp  Is  a  constant 

permanent  failure  rate  over  a  C-R  cycle,  then 

p  -  (1  -  e^p^).  It  Is  assumed  that  the  oc- 
P 

currenees  of  temporary  and  permanent  failures 
are  mutually  Independent.  _ 

In  the  matrix  of  transition  probabilities 
T,  p,  y  Is  the  probability  that  a  failure  free 

system  will  still  be  failure  free  after  one  C-R 
cycle.  Clearly  this  requires  that  no  permanent 
failure  and  at  most  one  temporary  failure  oc¬ 
curs  over  this  Interval.  (Since  we  always 
consider  the  state  of  the  system  at  the  Instant 
Just  after  restoration,  a  single  temporary 
failure  will  always  be  restored).  Therefore 

p11*  *  tq?  *  3qtpt> 

By  similar  reasoning 

p12  "  3qp  Pp  x  (qt  *  qt  V 
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Also  since  pn  ♦  p12  ♦  p13  •  1 
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la  the  probability  that  a  state  with 


one  permanently  failed  CU  Is  retained  over  a 
C-R  cycle.  This  requires  that  no  additional 
failures  take  place  In  the  two  CU's.  Therefore 


P22  -  *  <» 


Again  since  p22  ♦  pJ3  _ 


1  ’  P3 
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Once  the  one  step  transition  probability 
matrix  T  Is  obtained,  using  the  theory  of 
Harkov  Chains  the  transition  probabilities  over 
n  C-R  cycles  can  be  obtained  by  evaluating 
«n 


(T) 


The  reliability  of  the  systems  R(t)  to  n 
C-R  cycle  times  Is  the  probability  that  a 
failure  free  system  Is  still  operational  after 
n  C-R  cycles.  Since  state  1  represents  a 
failure  free  system  and  state  3  a  failed  sys¬ 


tem,  the  (1.3).  entry  in  (T)n,  gives  the 


probability  that  a  system  that  was  Initially 
failure  free,  fails  over  n  C-R  cycles.  (1  - 

p(”^)  Is  the  probability  that  the  system  Is 


1 3 


still  operational.  Therefore 
_  _(n) 


R(ntQ) 


•  3 


We  next  obtain  p 


(n) 


13 


and  hence  R(ntQ)  In 


closed  form  In  terms  of  the  one  step  state 
transition  probabilities.  This  can  be  done  by 
taking  advantage  of  the  fact  that  since  T  Is 

»n  . .  upper 


upper  triangular, 
triangular. 
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Is  also 
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equation  can  be  written  as 
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Obtaining  expressions  for  pj2  from  both  sides 


of  this  Identity  we  get 

,.(n)  _(n)  (n-1 , 

1  pn  _pi3  ■  pn  {1_pirpi3) 
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Noting  that  for  a  triangular  matrix  p^-Cp^)* 


and  P22^'^p22^k  Me  ®et 


1"(Pn)  "‘P^*  (Pn}  (1"pn*pi3) 


Rearranging  terms  gives  the  recurrence  relation 


p13  ”p22p1 3  )‘1"p22'(p11)  {1”p13-p22) 


This  equation  can  be  solved  using  known  methods 
such  as  the  one  described  In  Liu  [20].  The 
general  solution  to  the  recurrence  relation  la 
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Thus  the  reliability  of  the  periodically  self 
restoring  systems 


R(nt0)  -  1  -  pjj* 


Is  given  by 


R(nt0)  -  - — if—  (p22)n  . 
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Note  that  this  reliability  expression  Is 
completely  general  and  does  not  assume  any 
specific  failure  distribution  (such  as  the  con¬ 
stant  failure  rate,  exponential  distribution), 
for  the  CU's.  System  reliability  to  estimated 
based  on  the  temporary  and  permanent  failure 
probabilities  over  one  C-R  cycle,  and  these  can 
have  any  general  time  dependence.  However,  for 
simplicity,  we  do  use  constant  failure  rates  to 
obtain  the  reliability  plots  In  Figures  4  and 
5.  In  these  figures  the  transient  failure  rate 


1,,  Is  taken  to  be  0.01  per  hour  (an  average  of 
c 


one  failure  every  100  hours)  and  the  permanent 


failure  rate  lp  to  be  0.001  per  hour.  These 


values  are  consistent  with  observations  [10] 
that  permanent  faults  cause  only  a  small  frac¬ 
tion  of  system  failures. 


Figure  4  shows  plots  of  reliability  versus 
time  for  different  C-R  cycle  times,  tQ.  Figure 

5  shows  the  same  plots  for  short  mission  times 
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Figure  4:  Reliability  plots  for  the  N  -  3 

systems 


Figure  5:  Reliability  Plots  for  the  N  -  3 
PSRR  system  for  short  missions 


important  In  ultra-reliable  design.  As  ex- 
pected  the  plots  indicate  that  reliability 
improves  with  more  frequent  restoration  l.e., 
as  the  C-R  cycle  tine  is  decreased.  Also  in¬ 
cluded  in  the  figures  are  reliability  plots  for 
a  unrestored  TMR  design  with  no  provisions  for 
rest' ring  temporarily  failed  CU's  (tg  -  »),  and 

a  hypothetical  system  that  recovers  tnstan* 
taneously  from  temporary  failures  in  any  one  CD 
as  long  as  the  other  two  CU's  are  operational 


(tQ  •  0).  These  two  plots  provide  bounds  on 


the  reliability  of  the  proposed  scheme.  If  the 
C«R  cycle  time  is  made  very  large,  in  the  limit 
infinite,  restoration  does  not  take  place  in 
the  proposed  scheme  before  the  system  falls, 
and  its  reliability  is  reduced  to  that  of  the 
unrestored  TMR  design,  which  has  no  provision 
for  restoring  failed  CU’s.  On  the  other  hand 
as  the  C-R  cycle  time  is  reduced  and  approaches 
;zero,  recovery  from  temporary  failures  is  al¬ 
most  Instantaneous  and  the  proposed  scheme 
approaches  the  hypothetical  system  described 
above.  In  practice  this  upper  limit  on 
reliability  can  never,  of  course,  be  reached 
since  it  requires  that  both  the  computational 
Interval  and  the  restoration  Interval  be  of 
zero  duration. 
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Notice  in  Figures  4  and  5  the  substantial 
improvement  in  reliability  offered  by  PSRR  sys¬ 
tems  as  compared  to  unrestored  TMR.  This  is 
particularly  true  of  PSRR  systems  with  short 
C-R  cycle  time.  For  example,  over  a  16-hour 
mission  time.  Figure  5  indicates  that  the  un¬ 
restored  TMR  system  la  eight  times  more  likely 


to  fail  than  a  PSRR  system  with  tQ  •  0.5  hours. 


(Recall  that  the  probability  of  system  failure 
is  [  1  -rel lability]. )  As  discussed  in  the  pre¬ 
vious  section,  such  values  of  C-R  cycle  time  trt 


are  quite  practical  for  PSRR  systems. 


4.  MEAN  TIME  TO  FAILURE  CALCUUTIONS 


Another  parameter  of  Interest  in  fault 
tolerant  systems  is  the  mean  time  to  failure 
(MTTF).  In  this  section  we  derive  the  MTTF  for 
triplicated  PSRR  systems. 


Let  be  the  average  number  of  C-R 


cycles  taken  by  a  system  starting  in  state  1  to 
go  to  state  J  for  the  first  time.  Then  for  the 
three  state  triplicated  PSRR  System  under  dis¬ 
cussion,  it  takes,  on  the  average  C-R 

cycles  for  a  system  with  all  CU's  operational 
to  fall.  The  system  MTTF  is,  therefore,  given 


by  u13  x  tQ. 


To  evaluate  u13>  consider  a  system  one  C-R 

cycle  after  starting  in  state  -1.  At  this  point 
in  time  the  system  may  be  in  any  one  of  the 
three  system  states  1,2,  or  3,  with  probability 
Pn,  P12  and  pJ3  respectively.  -Therefore,  from 

this  Instant  the  average  number  of  C-R  cycles 
to  system  Tallure  is  given  by  Pnu13  ♦  P1 3^33  * 

Since  the  system  was  in  state  1  one 

.C-R  cycle  before  the  Instant  under  considers- 
tlon. 


u13  ’  ’  *  P1 1 U1 3  *  P12w23  *  P13W33’ 


It  can  be  similarly  seen  that 


u23  ’  1  *  P21u1 3  P22w23  *  P23U33* 


Noting  that  "  0  ( the  average  number  of  C-R 


cycles  to  go  from  state  3  to  Itself  for  the 


first  time  is  0)  and  p21  •  0,  the  above  two 


equations  reduce  to 


“13  "  1  *  p11u13  *  P12U23 


U23  *  1  P22U23 


Solving  simultaneously  we  get 


23  1-P, 


13  (1-P,,)(1-P22) 
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Therefore,  the  system  HTTP  Is  given  by 


HTTP  <t>12  *  P23}t0 
(1-Pll)(1-P22) 

It  should  be  noted  that  the  above  express 
slon  for  MTTF  Is  valid  only  for  22  This  Is 

because  In  the  derivation  of  y^,  It  Is  im¬ 
plicitly  assumed  that  It  always  takes  more,  than 
one  C-R  cycle  for  the  system  to  fall,  an  as* 

sumption  that  Is  not  valid  unless  t.SSj— . 

t  p 

This  Inequality  Is  satisfied  for  the  range  of 
values  of  t^  In  Figure  6,  which  displays  a  plot 

of  MTTF  versus  tQ  for  -  0.01  per  hour  and  lp 

•  0.001  per  hour. 

We  have  seen  In  Figure  U  that  the 
reliability  of  a  PSRR  system  always  improves 
with  more  frequent  restoration.  Therefore,  the 
system  MTTF  must  also  Increase  with  reduced  tQ. 

While  this  Is  not  immediately  apparent  from  the 
above  expression  for  MTTF  because  the  state 
transition  probabilities  also  depend  on  t^,  It 

Is  confirmed  by  the  plot  In  Figure  6.  Again  as 
expected,  the  longest  MTTF  Is  obtained  when  tQ 


t  •  •  l|  CW 

Figure  6:  MTTF  versus  C-R  cycle  time  tg 
for  the  H  -  3  PSRR  system 


approaches  sero.  At  tQ  grows  large,  the  system 

MTTF  asymtotlcally  approaches  that  for  a  unres¬ 
tored  TMR  system  with  no  provisions  for 
restoring  temporarily  failed  CU's. 

Notice  in  Figure  6  the  very  substantial 
improvement  in  MTTF  achieved  by  PSRR  systems  as 
compared  with  unrestored  TMR.  For  short  C-R 
cycle  times  of  up  to  2  hours,  which  are  ex¬ 
pected  for  such  systems,  the  MTTF  for  PSRR 
systems  Is  over  four  times  that  for  unrestored 
TMR  systems. 


5.  INITIAL  FAILURE  PROBABILITIES 

The  reliability  of  most  current  fault 
tolerant  systems  is  defined  and  evaluated  as» 
suming  that  it  is  known  with  certainty  that  all 
redundant  subsystems  are  fault  free  at  the 
start  of  the  mission.  For  practical  systems, 
this  Is  usually  not  be  a  valid  assumption. 
Because  of  the  difficulty  of  completely  testing 
’complex  systems,  testing  procedures  usually  es¬ 
tablished  to  a  high  probability  (fault 
coverage)  that  the  system  Is  fault  free. 
Unfortunately  even  a  small  possibility  of  a 
faulty  subsystem  at  the  start  of  the  mission 
can  very  significantly  effect  system 
reliability.  Our  modelling  approach  for  PSRR 
systems  offers  the  Important  advantage  that  the 
possibility  of  one  or  more  CU  failures  before 
the  start  of  the  mission  can  be  Incorporated 
Into  the  reliability  model.  We  show  how  this 
Is  done  In  this  section. 

Let  p3t  be  the  probability  that  a  CU  In 

the  system  has  failed  temporarily  before  the 
start  of  the  mission.  Let  p3p  be  the  probabil¬ 
ity  that  a  CU  in  the  system  has  failed 
permanently  before  the  start  of  the  mission. 
Then  the  system  must  In  In  one  of  the  following 
four  states  at  the  start  of  the  mission: 

a)  All  CU's  operational 

b)  Two  CU's  operational,  one  temporarily 
failed 

c)  Two  CU's  operational,  one  permanently 
failed 

d)  Failed  system  with  two  or  more  failed 
CU's. 

If  we  assume  that  system  operation  always 
beings  with  a  restoration  Interval  (which  as 
before  Is  assumed  to  be  negligibly  short  In 
duration  with  no  possibility  of  additional 
failure  during  this  Interval),  then  this  In¬ 
stantaneous  restoration  will  always  take  a 
system  Initially  In  state  (b)  to  state  (a). 
Therefore  for  all  practical  purposes  states  (a) 
and  (b)  can  be  grouped  together  to  form  a 
single  state  corresponding  to  state  1  In  the 
Markov  model  of  the  prevlcis  sections.  State 
(c)  corresponds  to  state  2  and  the  failed  sys¬ 
tem  state  (d)  to  state  3. 

The  probability  that  the  system  Is  In 
state  1  at  the  start  of  the  mission  Is,  there¬ 
fore,  given  by 

PI1  •  (1  ■  V3  *  1(1  -  pst)3  *  3pstC,',pst)2, 

The  probability  that  the  system  Is  Initially  In 
state  2  Is 

pI2  ’  3psp°  -  ps/  *  °  -  pst)2* 

And  the  probability  that  the  system  has  failed 
before  the  start  of  the  mission  (Is  In  state  3) 
Is  given  by 
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The  probability  that  a  system  with  these 
Initial  state  probabilities  fails  before  the 
completion  of  n  C-R  cycles  is 


PI’  P1 3 


PI2  P23 


Therefore  the  system  reliability  to  n  C-R 
cycles,  with  initial  failure  probabilities  conn 
aidered,  is  given  by 


R*(nt0)  -  1  -  PX1P13  -  PI2P23  "  PI3 


In  section  III,  it  is  shown  that 


(1  -  Pn  -  p22)  _  >n 
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Also  because  T  is  3*3  and  upper  triangular 


1  "  P220  -  1  -  (P22)" 


Substituting  for  pj^  and  p2”^  and  simplifying 


we  get 


R,<nV  “  1  “  pli  PI2  "  PI3 
^(P23"P13> 


(p22  “  Pn) 


<Pl/PI1 


P22  ^P11  <P22)"Pn 
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The  affect  of  possible  temporary  and  per¬ 
manent  initial  CU  failures  on  system 
reliability  is  illustrated  by  the  plots  In 
Figure  7,  which  were  obtained  using  the  above 
expression.  To  Isolate  the  contributions  of 
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Figure  7:  Reliability  plots  displaying  the 
effect  of  initial  CU  failure 
probabilities. 
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each  of  the  two  types  of  initial  failures  and 
to  also  observe  their  combined  affect,  four 
reliability  plots  are  displayed.  These  are  for 


a  system  with  X¥  -  0.01  per  hour,  X  -  0.001 
L  P 


per  hour,  tQ  -  0.1  hours  and  (1)  no  possibility 


of  initial  CU  failures;  p3t  «  0,  p3p  »  0,  (11) 


the  possibility  of  only  temporary  initial  CU 


failures;  pst  -  0.05,  pjp  -  0,  (ill)  the  pos¬ 


sibility  of  only  permanent  initial  CU  failures;. 
p.„  *  0  p_^  -  0.05,  and  ( lv)  the  possibility  of 

3  v  3p 

both  temporary  and  permanent  Initial  CU 


failures;  p  -  0.05,  p„  -  0.05. 

3b  3p 


The  plots  In  Figure  7  show  that  both  temn 
porary  and  permanent  initial  CU  failures  reduce 
system  reliability  at  the  start  of  a  mission. 
However,  the  possibility  of  a  temporary  Initial 
CU  failure  does  not  have  any  further  Impact  on 
system  reliability  over  the  rest  of  the 
mission.  This  is  because  any  such  failure 
would  be  restored  during  the  first  restoration 
interval.  On  the  other  hand  a  permanent 
initial  CU  failure  will  exist  in  the  system  for 
all  time  and  continue  to  degrade  system 
reliability,. 


Notice  from  the  plots  that  for  equally 
probably  temporary  and  permanent  initial  CU 
failures,  the  permanent  failures  degrade  the 
system  more  severely.  Note  also  that  the  pos¬ 
sibility  of  initial  CU  failures  has  a  more 
.significant  impact  on  systems  that  call  for 
highly  reliable  operations  over  short  periods. 
For  example,  let  us  compare  the  Increase  in 
system  failure  probability  due  to  a  0.05  prob¬ 
ability  of  both  temporary  and  permanent  initial 
CU  failures  over  that  for  an  initial  failure 
free  system.  For  a  mission  time  of  96  hours, 
the  probability  of  system  failure  is  not  sig¬ 
nificantly  Increased  if  only  temporary  initial 
CU  failures  are  allowed,  and  is  increased  by  a 
factor  of  1.5  If  only  permanent  initial  CU 
failures  are  allowed.  On  the  other  hand,  for  a 
shorter  16  hours  mission  time,  the  Increase  <n 
system  failure  probability  is  much  mo.-e 
significant.  The  0.05  probability  of  temporary 
initial  CU  failures  alone  doubles  the  probabil¬ 
ity  of  system  failure,  while  the  possibility  of 
permanent  initial  CU  failures  Increases  system 
failure  probability  by  a  factor  of  7.  The 
initial  CU  failure  probabilities  impact  even 
more  significantly  on  more  reliable  operation 
with  shorter  mission  times. 

System  MTTF  can  also  be  derived  to  take 
into  account  the  possibility  of  initial  CU 


failures.  Recall  that  u.  .  is  the  average  num- 
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ber  of  C-R  cycles  that  it  takes  for  a  system  in 
state  k  to  go  to  state  3  for  the  first  time. 
At  the  start  of  the  mission  of  system  is  in 
state  1 ,  2  or  3  with  probability  PX1,  PI2  and 

Pjj  respectively.  Therefore,  the  average  num¬ 
ber  of  C-R  cycles  required  for  the  system  to  go 
to  state  3  is  given  by  Njtepa  -  Pn  u,,  *  PT, 


"  PI1  W1 3 


•u23  *  Pj3  u33  Uoting  that  »33  *  0  and  from  sec¬ 


tion  IV, 


P12  *  p23 

J13  "  (l-Pn)(l'P22) 
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steps  (1-pn  )(1-p22)  (1_P22) 


The  system  MTTF  with  Initial  failure  probabil¬ 
ities  considered  MTTF'  •  N*  x  t_-whlch  is 

steps  0 

therefore  given  by 

pn  (p 


MTTF'  - 
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P23)C0 
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(1  -  Pn  )(1-P22) 


(1-p22) 


Figure  8  shows  a  plot  of  MTTF*  versus  tQ 

for  a  system  with  X.  •  0.01  and  X  -  0.001  and 
t  p 

the  same  four  sets  of  Initial  CU  failure  prob- 
abilities  plotted  In  Figure  7.  Note  that  for- 
equally  probably  temporary  and  permanent 
Initial  CU  failures,  the  permanent  failures 
more  significantly  affect  system  MTTF.  For  the 
range  of  values  of  tQ  plotted,  a  0.05  probabil¬ 
ity  of  temporary  Initial  CU  failure  results  In 
only  about  a  IX  reduction  In  system  MTTF,  while 
the  same  probability  of  permanent  Initial  CU 
failure  results  In  about  a  12X  reduction  in 
system  MTTF.  This  Is  to  be  expected  based  on 
the  preceding  discussion  which  concluded  that 
the  possibility  of  permanent  initial  CU 
failures  has  a  more  significant  Impact  on  sys¬ 
tem  reliability. 


Figure  8:  MTTF  plots  displaying  the  effect  of 
initial  CU  failure  probabilities 


For  the  plots  In  Figures  7  and  8  the  prob* 
ability  of  temporary  and  permanent  Initial  CU 
failures  was  taken  to  be  0.05.  While  this 
value  nay  be  quite  pessimistic  for  practical 
fault  tolerant  systems,  it  was  chosen  to  high¬ 
light  the  affects  of  the  possibility  of  Initial 
CU  failures  on  the  system.  From  the  form  of 
the  reliability  and  MTTF  expressions  It  Is 
clear  that  the  observed  trends  also  hold  for 
more  realistic  (smaller)  Initial  CU  failure 
probabilities.  It  appears  then  that  the  pos¬ 
sibility  of  Initial  CU  failures,  particularly 


permanent  Initial  CU  failures,  can  sig¬ 
nificantly  degrade  the  reliability  of  PSRR 
systems  designed  for  highly  reliable  operation 
over  short  mission  times.  Their  affect  on  sys¬ 
tem  MTTF  Is  less  significant. 

6.  DISCUSSION 

In  order  to  maximize  reliability,  a  fault 
tolerant  system  Is  usually  designed  with  the 
following  objectives.  (We  assume  here  that  the 
software  Is  error  free.) 

1.  The  hardware  should  be  free  of  design  er¬ 
rors  and  should  meet  the  operational 
specifications  of  the  desired  system. 

2.  The  component  failure  rate  for  the  hardware 
should  be  as  low  as  possible. 

3.  The  fault  tolerance  mechanisms  built  Into 
the  system  should  protect  It  against  the 
widest  possible  range  of  failures,  par¬ 
ticularly  from  failure  modes  that  are  most 
likely  to  occur. 

It  Is  also  usually  desirable  to  have  a 
good  quantitative  reliability  model  for  the 
system  so  that  the  risk  of  failure  can  be 
realistically  considered  when  deploying  the 
system. 

Due  to  the  difficulty  In  completely  test¬ 
ing  and  validating  complex  hardware,  objective 
(1)  Is  usually  be3t  achieved. by  Implementing 
the  system  out  of  proven  off  the  shelf  building 
blocks,  with  minimal  specialized  low  level 
design.  Such  an  approach  has  the  significant 
additional  advantage  of  low  hardware  and  design 
costs.  PSRR  systems  clearly  meet  this 
objective.  Except  for  the  fault  tolerant 
■clocking  circuit  (which  all  hardware  redundancy 
designs  require)  no  additional  specialized 
hardware  is  needed.  The  vote  Is  conducted  by 
the  processors  themselves,  with  communication 
between  the  CU's  taking  place  through  conven¬ 
tional  I/O  ports.  On  going  research  Indicates 
that  robust  distributed  restoration  algorlthr.3 
can  be  developed  for  general  M  redundant  PSRR 
system  that  will  purge  out  permanently  failed 
CU's  from  the  vote  and  thereby  also  provide  the 
reliability  advantages  of  dynamic  redundancy. 
In  contrast,  redundant  systems  employing  con¬ 
current  error  masking  require  voting  circuits. 
The  design  of  the  voters  for  C.vmp  type  systems 
is  non-trlvlal  [17]  because  timing  and  control 
signals  between  the  processors  and  memories 
must  also  be  voted  on  and  correctly 
synchronized.  Dynamic  redundancy  designs  such 
as  the  SIFT  and  FTMP  require  even  more  exten¬ 
sive  specialized  design  to  Implement  the  fault 
detection  and  reconfiguration  mechanisms. 

The  objective  (2)  of  low  failure  rates  In 
the  hardware  Is  also  met  by  emploving  proven 
components.  Further,  In  an  VLSI  technology, 
failure  rates  are  largely  a  function  or  the 
number  of  IC'a  In  the  design  and  their  Inter¬ 
connection  complexity.  To  prevent  a  single 
fault  from  causing  system  failure.  It  is 
usually  not  desirable  to  Implement  highly  rellm 
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able  redundsnt  systems  on  a  single  chip.  It 
seems  optimal  from  a  reliability  standpoint  to 
Implement  a  N-redunaant  system  using  N  IC's 
such  that  no  two  copies  or  a  circuit  are  on  tne 
same  chip.  Figure  2  Indicates  th3t  PSRR  sys¬ 
tems  are  well  suited  for  such  an 
Implementation,  with  eacn  CU  (which  la  Basi¬ 
cally  a  conventional  microcomputer )  Delng 
realized  on  a  different  chip.  A  three  chip  lmrt 
pleaentatlon  of  the  C.vmp  architecture  in 
Figure  1  would  require  more  complex  IC's  and 
significantly  greater  Interconnect  complexity, 
probably  resulting  In  higher  failure  rates. 

Next,  let  us  consider  objective  (3)  which 
addresses  the  types  of  failure  that  are  handled 
by  the  fault  tolerance  mechanisms.  While  also 
providing  protection  against  permanent  faults, 
PSRR  systems  are  patlcularly  effective  in 
providing  recovery  from  upsets  caused  by  tran¬ 
sient  faults.  This  Is  significant  because  a 
large  majority  of  computer  system  "crashes'*  are 
believed  to  be  caused  by  transients.  Systems 
that  do  not  provide  recovery  from  such  upsets 
are  very  inefficient  In  their  use  of  redundant 
hardware.  Of  course.  PSRR  systems,  like  other 
redundant  systems,  remain  vulnerable  to  simul¬ 
taneous  upsets  In  multiple  units.  To  provide 
protection  against  such  failures,  we  are  cur¬ 
rently  studying  tine  redundancy  techniques.  It 
•appears  that  these  techniques  can  bo  easily  lnR 
tegrated  Into  the  PSRR  approach  of  periodically 
voting  on  redundant  system  states. 

A  major  concern  in  the  Implementation  of 
ultra  reliable  systems  Is  hardware  design 
faults.  Because  such  faults  may  often  only 
manifest  themselves  under  unusual  combinations 
of  system  state,  Inputs  and  environmental  con" 
dltlons,  they  are  difficult  to  test  for  and  can 
rarely  be  completely  eliminated  In  todays  com¬ 
plex  systems.  Further,  the  unreliability 
Introduced  in  a  system  by  design  faults  Is  dif¬ 
ficult  to  bound.  For  ultra-reliable  designs, 
this  can  render  reliability  estimation 
meaningless.  Unfortunately,  many  of  the  redun¬ 
dant  designs  presently  deployed  do  not  provide 
any  protection  from  de3lgn  faults  because  tie 
resulting  failures  are  likely  to  simultaneously 
effect  all  the  redundant  units.  The  PSRR 
design  offers  some  Improvement  In  this  regard 
because  It  can  be  Implemented  almost  entirely 
from  proven  high  volume  off  the  shelf  hardware. 
This  will  allow  a  system  designer  to  obtain  the 
redundant  copies  of  the  hardware  from  as 
diverse  manufactur 1 ng  sources  as  possible. 
Such  hardware  Is  less  likely  to  have  common 
model  failures  and  will  thus  provide  some 
protection  from  design  faults. 

Finally,  as  wo  have  shown  In  this  paper 
for  triple  redundant  systems,  PSRR  systems  C3n 
be  nicely  modeled  for  reliability  estimation. 
This  should  allow  rellabil  Uy-redundancy  (cost) 
trade-offs  t-  be  computed  while  PSRR  systems 
are  being  designed  and  facilitate  the  sys¬ 
tematic  deal  ;n  of  such  systems  to  desired 
reliability  specifications. 


7.  CONCLUDING  REMARKS 

The  PSRR  design  presented  in  this  paper 
offers  a  viable  and  attractive  approach  for  lor 
plementlng  fault  tolerance  In  VLSI  based 
a  1 crocompu ter  systems.  The  fault-tolerance 
capability  of  such  a  system  depends  on  the 
level  of  redundancy  employed.  In  the  proposed 
design,  protection  against  permanent  failures 
Is  provided  by  error  masking,  and  the  purging 
of  permanently  failed  modules.  Erronous  sig¬ 
nals  from  modules  upset  by  transient  failures 
are  also  masked  out  and  the  modules  peri¬ 
odically  restored  to  the  correct  operational 
state  so  as  to  allow  the  system  to  withstand 
further  failures. 
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